Analysis of speech behaviours between genders.

Do speech behaviours related to confidence and uncertainty vary between men and women?


Analysis of speech behaviours between genders

Context

Among all species on Earth, humans have a unique capability of communication using a symbolic communication system, i.e., verbal and written language1. The highly sophisticated language enables humans to communicate in a very precise and complex manner. Still, communicative speech acts seem to differ between genders. One of the major differences in women and men’s speech is that men have been found to dominate conversations through the use of interruptions and overlaps2. Additionally, men use strong expletives, while women use politer versions.

In this project we investigate the variety of speech which is related to specific gender, social norms and variations in the use of language among those genders. We suppose men and women have different speech behaviours, women talking with more uncertainties (doubts). For example, we expect a woman to say “I expect this to do that” while a man would rather say “I know this does that”. Our idea is therefore to analyse whether there is a real difference between genders and, if so, to what extent it is the case.


Goals

We are interested in using this dataset to answer the following question:

"Do speech behaviours related to confidence and uncertainty vary between men and women?"

To answer this question, we'll go through the following points:

Profession gender difference logo

Profession gender difference

To what extent can we observe the differences in communicative acts in relation to gender within a professional area?
Culture gender difference logo

Culture gender difference

What are the roles of nationality, culture and education in determining those differences in speech between men and women?
Temporal gender difference logo

Temporal gender difference

Has there been a possible change over time from 2015 to 2020 concerning gender speech difference?


What is our data?

In the following, we analyse the data from Quotebank, an open corpus which gathers 178 million quotations (attributed to speakers) from 2008 to 2020. Still, in this project, we will only focus on the most recent quotations, being from 2015 to 2020. We combine this dataset with speakers’ information from Wikidata, a collaboratively edited open source knowledge base.

X


Methods

Creation of professional & background data frames

To have a general overview of the speakers’ occupations, we focus on four main professional fields: arts, science, economy and politics. Our speakers are then regrouped under professions from each professional field. Then, to determine the roles of nationality, religion and education in determining a possible culture gender difference in communicative acts, we selected a general data frame with no condition on profession.

Classifier

To analyse speech uncertainty, we adapted an already existing uncertainty detection classifier3, using 6 features. Uncertainty is defined by speculative verbs (like suggest or presume), adjectives and adverbs (like probably or possibly), auxiliary verbs (like must or should) or the use of some tense or modes of conjugation (subjunctive, conditional). This classifier is an automatic machine learning method to detect uncertainty in natural language.


Let's explore our data!


Before starting to investigate our research questions, let's have a look at what our dataset looks like.

Speakers gender exploration

We see that there are 32 genders present in Wikidata. In this analysis, we will only focus on 2 genders: “Female” and “Male”.

Quotes languages exploration

We notice that the majority of our the quotes are in English. Still, there are some non-English quotes in the dataset. These are removed from our analysis, limiting our analysis only on the English quotes.

Speakers female & male ratio

As seen from the data, a great majority of speakers are males, meaning there is a persistent under-representation of women in the news. Unfortunately, this was expected. Indeed, even today in the early twenty-first century, women continue to receive substantially less media coverage than men, despite women’s much increased participation in public life. However, only few studies have systematically examined whether such media bias exists.

Still, a recent American Sociological Review4 found that societal-level inequalities are the dominant determinants of continued gender differences in coverage: the media focuses nearly exclusively on the highest strata of occupational and social hierarchies, in which women’s representation has remained poor. As a result, we will focus our analysis on whether there is a difference in speech uncertainty in different professional areas, and whether those have improved from 2015 to 2020. Additionally, to broaden our analysis, the background of the speakers will also be analysed to find other possible correlations between the speakers’ environments and their speech behaviours.

Let's check out how many women are represented in the 4 professional fields we defined of interest.

The female ratio for the different occupation groups

Politicians logo

Politicians

21%
Artists logo

Artists

34%
Scientists logo

Scientists

25%
Economists logo

Economists

18%
All occupations logo

All occupations

20%


Results

Analysis of differences in communicative acts in relation to gender within the different professional areas


Looking at the figure above, there seems to be only few differences in speech uncertainty between men and women, regrouping all quotes (2015 to 2020) together, the females even seeming a little less uncertain than males, intriguingly.

Analysis of speaker's background influence on certainty

What are the roles of the environment (nationality), culture/tradition (religion), and education (whether the speaker obtained a specific academic degree) in determining those differences in speech between men and women? How are the lines drawn between the language we use and the environment around us?


Let's have a look at nationality:

Year 2016Year 2015Year 2016Year 2017Year 2018Year 2019Year 2020Difference of uncertainty between Females and Males over years (2015 to 2020)−0.2−0.100.10.20.30.4

The colors range from white, indicating that women represent the highest number of uncertain speakers in the country, to dark red indicating that men are the highest number of uncertain speakers from the country.
Using the cursor, we can have an overview of the distribution of gender uncertainty over the years. It seems that female speakers tend to become more uncertain as the years go by.

Let's have a look at religion

Let's have a look at academic degree


Looking at the figures above, there seems to be a consistent distribution of 70% certain and 30% uncertain speakers in each condition. Still, there seems that … .

Has there been a possible change over time from 2015 to 2020?



Statistical analysis

We performed linear regression using the certainty label we obtain from the Pajean uncertainty classifier to identify important features for certainty prediction. It is important to keep in mind that the classifier has a 62.8 F-score thus some of the results we obtain could be due to chance.
We considered features with p-value under 0.05 not statistically significant.


Conclusion

Through this notebook, we aimed to analyze the speech difference between women and men using the Quotebank dataset. We started from the hypothesis that women speak less confidently than men and in a more uncertain way. To verify this claim, we conducted an analysis with the help of a classifier which distinguishes uncertain quotations from certain quotations. We also used Wikidata as a supplement input data to study more closely the quotation speakers.
In our first analysis we found similar certainty levels for male and female at approximately 70%, we were considering all speakers indiscriminately.
Then, we performed various data frame separation with respect to the `occupation`, `religion`, `nationality` and `education`, to be able to measure the impact of each influence and to remove the bias. For our initial question, it seems that there is no significant difference between men and women when compared in the same field of work. But, we could note a significant overall certainty probability of 90% for this population, 20% more than when considered the full population.

We continued our research analyzing the backgrounds of the speakers, such as nationality, religion, and academic degree. Again background and gender interactions were rarely significant. This could be due to the important imbalance in gender of the Quotebank dataset (~ 75% males & 25% females). Thus interactions of a specific background with the female gender were measured on much smaller samples.

It is important to remember that women who are cited in Quotebank may not be representative of all women. They represent a subset of women who have acquired a high notoriety and acknowledgment. The imbalance of the Quotebank dataset suggests that this subset ratio is still today smaller amongst women than amongst men.
This could be an important bias in our study, which could have tilted our study towards equality amongst genders.